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(54) Method and system for document storage management based on document content 



(57) A document storage management system and 
method that manages the storage of documents based 
upon the similarity of the content of the documents. 
Groups of documents are created based upon the sim- 
ilarity of the contents of the documents. Those groups 
are displayed to the user in a ranked lid of selectable 
groups to permit selection of a group or document. The 



storage of the selected group or document is then man- 
aqed by, for example, deleting, compressing, or copy- 
ing The displayed list may be ranked based upon a least 
recently used policy, the relevance to a predeterm.ned 
topic, the size of the group, the radius of the group based 
upon the maximum distance of any document from the 
group centroid, the number of documents in the group 
and any other combination of parameters. 
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Description 



based on the content of the stored documents. 6 ™ din a ranqe of leisure and work activities. Laptop, 

[0002] information access plays a key role across an <^ ^ * appliances' with the promise of h- 

hand-held and pa.m-tc^ 

Sue that make. rx, rt ab.e computers less dependent 

[0003] A cache is generally defined as a fast access ^^^[^^ d °^ pto J e documents even when 
A cache reduces the reliance of the computer on a connection to stored in a 

disconnected from a network. Cache management policy governs 

[0004] An important part of a cache managemen system .s ft. "*°£Tt^^™ space to the cache to 
'which Kerns will be removed from the cache Lch decision involves predicting 

store a new item. A replacement policy requ.res .nherenMy drfhcutt ts for in(orrnati on have been 

the future. The accuracy of those predcmns '^^J^^^^S^ of the replacement policy. The 

LTe, When a computer is connected over a wireless .network j^^^^ ^pensive^es for 
relatively -ong time to locate and in local storage, there wou.d 

the network connection usage for the download. II the cache ^tnerequ avoided. Moreover, 

benoneedtodown.oadth^— 

:::zt™:^^ — — — investisat,n9 new re - 

pl acement policies that better predict ««^ , ^^^J a computationa. replacement policy with direct 
[0006] One attempt to solve these P^ ,em ^ ,n 7 h ^^^ kno ^ ich ^ th e most important information to 
user interacts. This is based on the assumpfon i hat P**'^" ^^rs that allowuseTsto lock documents 
keep in local storage. Two systems. leleWeb and I Mow^ ^^^^^S^ to the World-Wide 
into the local storage. Te.eweb and Mowgl. are described ,n ^TeleWei, Loosely Con^ ^ -optimizing World- 

Web', W.N. Schilit et al.. Computer Networks and ISDN Systems. g§ W™? 1 ' • Pr ^ flftriinQ of the 2nd 

Wide Web for Weakly Connected ^^^ZSZ^E^ Z^ (SDNE^) June 5*. 1995. respec- 

because it becomes filled with locked, but no longer r * e ™^™£*^ documen ts are appropriate to discard 

^^^^^ 

o, ,h, dalas o. las. access. The app-opna.. sorting ,s '""f ^ ^"^™ 18 a use, mos. explicit 
eoiemn. One probiem m* approach is tha, ^ asen ^«<^?£"J UoLnaiate a search gu«y 
«, desigr».e» 1 osadocumsn K .ha.lheus^ 

^Som.s.oragernanagemenrsvs^.ry.op^^ 
* Systems hav. been imp— ,ha, a»ow osers £ ■ ^ , cordmg « U accesses .The 

•Deuc.ion and Exportation ol Sals ^ ™ «^^J ^ jn a Fite System". Jamas 
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rated herein by reference in their entiretie Traces of file accesses are used by these systems to hoard or prefetch files 
into the local storage. More recently, a system described in 'Intelligent File Hoarding for Mobile Computers'. C. Tart et 
al MOBICOM95 pp. 119-125, ACM, Inc.. incorporated herein by reference in its entirety, extruded this concept to a 
graphic interface that permits users to select which of a number of profiles to hoard. These profiles are created by 
observing patterns of file accesses to both application files and data files. However, this style of storage management 
only works well when users have recurring and predictable pattern of file accesses. 

[0010] A number of systems have used automatic techniques to present groups of related documents to a user. One 
technique characterizes clusters of documents with keywords and titles of representative documents in order to support 
browsing or full-text retrieval. This technique is described in 'Scatter/Gather: A Cluster-Based Approach to Browsing 
Large Document Collections'. D.R. Cuttinget al.. Proceeding s of the 15th Annual International ACM/SIGIR Conference. 
1992 ACM Inc . and 'Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results'. MA Hearst et al.. 
Proceedings of the 19th Annual International ACM/SIGIR Conference . Zurich 1996. incorporated herein by reference 
in their entireties. . 
[001 11 Another system presents clusters of Web documents with keyword pairs based on therr intra-cluster similarity, 
and allows the user to expand the contents of clusters. Such a system is described in 'Automatically Organizing Book- 
marks per Contents'. Y.S. Maareck et al. Computer Net works and ISDN Systems 28. pp. 1 321 -1 333 1 996. incorporated 
herein by reference in its entirety. However, these systems do not provide facilities for deleting, compressing, or taking 
other management actions on groups of documents. 

[0012] None of the above systems works well for personal information appliances that provide access to information 
that cannot be neatly organized and that do not have a recurring access pattern. 

r001 3] A document management system is needed that augments the advantages of direct user interaction with a 
minimally invasive technique, does not fill up the cache with locked documents, and does not rely upon recurring access 
patterns 

[0014] This invention provides a method and a system that allow users to manage storage space by choosing among 
groups of documents having similar content. The content of the documents may be presented to the user m groups 
identified by topics Grouping documents allows the user to free up space quickly by avoiding expensive file-by-file 
decisions Grouping documents also permits the user to determine which files need to be stored locally, ertherfor speed 
of access or to avoid interruptions when subsequently operating in a disconnected state, based upon the topics for 
which the user anticipates a future need. For example. Fig. 2 shows a GUI 20 according to one embod.ment of this 
invention The GUI 20 presents three groups from which the user may choose. Each group has a topic that may 
characterize the documents collected into that group. For example, as shown in Fig. 2. documents are collected into 
three groups, a first group 22 entitled '95 Windows Microsoft NT network workstation', a second group 24 entitled 
■mobile wireless network computing workstation system' and a third group 26 entitled "developer HTML web CGI 
Microsoft Program URL'. Thus, the first group 22 includes documents related to networking with Windows 95 and NT, 
the second group 24 includes documents related to issues in mobile Computing and wireless networks and the third 
group 26 includes documents related to developing web applications. 

[001 5] The user can determine which files (i.e. documents) to remove, prefetch or hoard based upon these groups 
of documents that have similar content. The user can quickly select a group of documents having similar, but no longer 
relevant content, delete that group, and free up a large amount of space by removing the entire group. The user may 
also anticipate which groups of documents will be used in the future and prefetch or hoard these documents from an 
external source into the local storage based upon the content of those documents. 

[0016] The user may rely solety upon the determination of the similarity of the content of the documents wrthm the 
groups or may additionally rely upon the determination of the date and time of last access, the relevance of the groups 
to a predetermined topic, the total size of the group, the radius of the group, and/or the number of documents in the 
groups Additionally, the groups may be ranked by the date and time of last access to the documents in the group. The 
qroups may be, alternatively, ranked in accordance with any number or combination of group characteristics. 
[0017] The grouping of documents can also be used as a guideline for making document storage decisions For 
example, a group may be selected out ol a listing of groups and the selected group may be expanded into a list of 
document from file selected group. Preferably, the document list is ranked in accordance wrth predetermined or user 

■ defined attributes. The user is then flee to select one, several or all of the documents within the group for a storage 
management process. , ,. 

[001 8] These and other futures and advantages of this invention are described in or are apparent from the following 
detailed description of the preferred embodiments. . 
[001 9] The preferred embodiments of this invention will be described in detail, with reference to the following figures, 

> wherein: 

Fig 1 is a graphical user interface of a conventional document storage management system; 

Fig. 2 is a graphical user interface showing documents grouped by the similarity of content of the documents in 
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2J 5 tea tow chart outlining document similarity grouping routine of the invention; and 
Fig. 6 is a block diagram of one embodiment of a processor of this invention. 

r00201 Fig 3 shows a blockdiagram of an information system 30 operating with the .^^^3^ ^ 
Krmat^ storage systems regardless of Aether they are portable mm 

that although this description generally refers only to ^cumeoto -J-J*" ^™ ^ntof mie fUes objects. A 
file or object that may be grouped with 

•file" is intended to include any unit of information Content is mienoeo 10 nciuue c y 

™;r.ss™s,r.™Ji~~«™»»'»»""'-«~-' 

document is intended . limanl . in , hfi loca i storaae device 34 by selectively 

& 1 2=w^ 

! rcrr=s=^^^ 

I"oSl As shown in Fi 9 3. the system 30 is preferably implemented using a programmed general ^^"^ 

L coL,un*a,ion Menace 40 and me processor 38 to Ihe memory 36 »>d ^ ^araa netork. an 

» wired or wireless links lo a network (not shown). The network can be a local area ^network a « 

intranet, me Interne, o, an, olhe, distributed processing and ^J^^^^^S^^ « « 
„om, physkralr, remote ekt.ma, *taeouree 42. tZ^SKl the entire electronic 

55 the system 30. m w^^ a „t nf the control routine of the document management 
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42. The local documents are grouped when more storage space is needed in the cache 32 and the documents in the 
external data source 42 are grouped when the information system 30 is loading more documents into the cache 32. 
Grouping the documents relies upon procedures that are well known in the art. These procedures group documents 
based upon the similarity of the content of the documents. An example of such a procedure is disclosed in "Automatic 
Text Processing", by Gerard Sulton, pp. 303-309, Addison-Wesley Publishing Company, Inc. (1989), which is incorpo- 
rated herein by reference in its entirely. 

[0030] The specific method chosen for grouping the documents is not crucial, because grouping can take place in 
the background. Therefore, the speed of the grouping method may not be important. The only requirement is that the 
system groups the documents based upon the similarity of the content in the documents. It is to be understood that 
the term "similarity" is intended to include any measure of a document or a portion of a document's relatedness oT 
relevance to another document or portion of a document. The similarity may be calculated, predetermined or determined 
by a user. The documents may be grouped by any conventional clustering algorithm Examples of clustering algorithms 
inlclude algorithms that cluster based on the size, radius of the document, the time ofa document's creation, last edit 
or last access or any other attribute of a document. Overlapping of groups may be allowed or prohibited. A specific 
example of a grouping method wilt be described in detail below. 

[0031] The control routine displays an ordered or ranked group listing at step S120. An example of a ranked group 
list is shown in Fig. 2. The control routine then determines, at step S1 30. whether a group has been selected by the 
user. If a group has been selected, the control routine continues to step S140. Otherwise, the control routine jumps to 
stepS 190. 

[0032] In step S140. the control system determines whether the user has requested that the selected group be 
expanded. If the control routine determines that the user has requested that the selected group be expanded, then the 
control routine continues to step S1 50. Otherwise, the control routine jumps to step S1 70. In step S1 50. the document 
management system displays a ranked list of the documents within the selected group Fig. 2 shows an expanded 
listing 28 of the selected group 22. 

[0033] If, at step S140, the control system determines that an expand group command has been received, then the 
control routine proceeds to step S150. At step S150 the control routine displays a ranked list of documents within the 
selected group. Each document within the group is selectable by a user. If the control system determines at step S160 
that a document has been selected, then the control routine proceeds to step S170. 

[0034] At step S1 70, the system determines whether a user has input a command to operate on the selected doc- 
ument or group of documents. If at step S170, the system receives a command, then the control routine continues to 
step S180. The command may include but is not limited to a command to delete the selected document or group of 
documents from the cache or to store the selected document or group of documents from the external data source 
into the cache. In step S180, the control routine executes the command on the selected document or group of docu- 
ments. The control routine then proceeds to step S1 90. 

[0035] Alternatively, if at step S1 60, no document is selected, then the control routine continues to step S1 90, Sim- 
ilarly, if at step S1 70, the control system does not receive a command, then the control routine also jumps to step S1 90, 
In step S190, the control routine stops, 

[0036] In step S1 70, the user of the method and the system of this invention may choose from any number of storage 
management commands. Examples of storage management commands include deletion compression, and copying 
of the selected document or group of documents. 

[0037] One method for determining the similarity of documents comprises the method outlined in the flow chart of 
Fig. 5, which begins in step S300. Next in step S310. individual words occurring in the documents of a collection are 
identified. Then, in step S320. a stop list of common function words ("and," "of," "or," "but," "the : " and the like) are used 
to delete high-frequency function words that are insufficiently specific to represent the content of the documents, Next, 
in step S330 an automatic suffix-stripping routine is used to reduce each remaining word to word-stem form. This 
routine reduces all words exhibiting the same stem to a common form. Control then continues to step S340. 
[0038] Next, in step S340, a typical document similarity coefficient is then obtained A description of a document 
similarity coefficient calculation of a method and System of one embodiment of this invention follows. For each remain- 
ing word stem Tj and document a weighting factor is determined. This weighting factor includes in part the 
term frequency and in part the inverse document-frequency for the term. For example, the weighting factor is 
determined as: 

W| = V°"«aij>- (1> 

[0039] Then, document vectors are calculated. A document vector for each document Q, is represented by the set 
of word stems together with the corresponding weighting factors: 
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D l =(T 1 .w„.T a .w B ;...T 11 w i ); (2) 

[0040] or 

= (D iv D B D a ), (3) 



Sim(Dj.D2) - I D,j D 2 J- W 



similarity factor of Eq. (4) is advantageously normalized: 



Sim(D],D2) - 



M= (5) 



I0 043] Eq (5) represents the cosine of the angle between the document vectors considered as vectors in a space 
oft dimensions, where t is the number of distinct tonrnsinthe ^m contjnues tQ step 

100441 Aflerthe similarity coefficient is determ.ned between document vectors in siep formed 
SS?,n step S350, documents are ranked in decreasing order of sim^. Next, in g^J^SJ^S 
based on a similarity metric. For example, the documents ma q, 'be t V^^^^^ only if the merger 
document In its own group and iteratrvely merges groups. A group .s me^ed ^^^J^ remain 
resu.ts in a group with the smallest possible rad.us. ^^^^'^^^ gr oup that includes the first 
as potential mergers then the first group .s exam.ned to ^^^J^^^^^ tne sec ond group. If 
grJup and a second group is examined to determine the rad.us of a l ~^™^J^^S udB . the second 

merqinq is possible without exceeding a radius limit. qroups of 

TooS] 9 ThTre are several attributes of documents orf.es that W 
candidates for removal. Groups that take up more storage space are ^^^^ til ^ a "narrow" 
retevance that has been removed is less likely to .ncur a *^ 8 ^^S^ minimizes the 

group (one containing very similar documents) is a good cand.date for ^ rted tor a gro up does 

risk Sat users will inadvertently remove important thfs invention 

not resemble some out.ying document mat happens to be ^ 

which groups to present to the user. D _»„ t iw i i«aH fl Ftm oolicv where a document's rele- 
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been a long time since its last access, is the best candidate for removal. The system of this invention sets the LRU 
time lor a qroup of documents to be that of the most recently accessed document. 

SSm Sm system may optionally have an evolving query that approximates the user's interests, or some other 
Snismlor relevance .feedback A relevance feedback process can be used to improve query formulate wh.le 
altering the original queries in two substantial ways: 

1. Terms present in previously retrieved documents that have been identified as relevant to the query are added 

to the oriqinal query formulations; and . 

2 The weights of the original query terms are altered by replacing the inverse document-frequency portton of the 

wefchts^ 

retrieved relevant and non-relevant documents of the collection. 
(0048] Assuming that the initial query is available in the form specified in a manner similar to Eq. (3). a reformulated 
query will take the following form: 

Q=a(W q1 .W <l2 .....w < , l )+ 

P(W qt+1 .w qt+2 w qt+n ); (6) 

where 

u and [J may take values between 0 and 1 ; and 

the weights of the t initial terms may be generated by taking combined term-frequency and mverse-document 
lr^) nC The e we!ghts of the newly added terms t t+l to m . on the other hand, may be determined as a combined term- 

frcaucncv and a term-relevance weight. 

WSO] Witharelevanceleedbackmechanismthe^^ 

The system can then base the relevance for a group of documents on the document having the max.mum relevance 

WW The system may optionally have a user controllable device, such as a GUI interface for se « in 9^ p m ~ 
size of the groups. This allows the user to decide how much benefit they require for each .nteract.on wrth the storage 

[ou521 ^he Astern may also permit the user to determine the minimum or maximum width of groups that are to be 
pSt J To compute the width of a group, the system uses the radius, which is defined as the ^^md^nce 
of any element from the centroid of the group. This metric is preferred to an ,ntra-cluster stmjarrty (e.g. . 
distance from the centroid) because it is more sensitive to outliers, and the outliers are the documents ,n a group that 
are the least similar in content to the other documents. «,u„„ ra „i n , ic | u 
[0053] Thegroupsofdccumentsmaythenberankedinaccordan^^ 

described melrics or in accordance with a group score calculated from these metrics. A score for each group can be 
expressed as: 

Group Score = cc*LRU + (7) 
(J * (relevance) t 
Y * (total size) + 
6* (cluster radius) + 
e* (# of documents), 

where a, y. M and c are user defined constants. . „■ 

[0054] ^entheuseristhecache.thegroupswiththelowestscoresarepresen^ 

higher scores. Alternatively, groups with the highest scores are presented first when the system ,s be,ng used to select 
documents to download from an external data source. . . P - 0 Por 

[0055] The system of this invention presents the groups in a scrollable, explodable table 20 as shown .n Rg_ 2. Fo 
each group, the user sees the rank 60. distinguishing keywords 62. the number of documents conned I wrth-n that 
group 64. the total size of the group (in kitobytes) 66 and a most recent time and date of access of me group 6a For 
groups with only one document, the list of keywords may be replaced with the document title. In.t,ally. the table 20 .s 
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sorted by the rank 60, but the user may sort the table 20 by the size 66, the number of documents 64 or the time of 
last access 68 The user changes the sort variable by selecting the heading of the respective column, when a user 
explodes a group to a first level, the system displays the titles and attributes of the most representative documents 28. 
When a user explodes the group father, all the documents in the group are displayed sorted by time of access 68. 
Users may select entire groups or several files within a group for removal. 

r0056] Fig 6 shows a block diagram of one embodiment of the processor 38 or processing system of this invention. 
As shown in Fig 6, the processor 38 is implemented using a general purpose computer 90 compnsing a controller 85. 
a memory 88 a group generator 72, a keyword generator 74, a group score calculator 76, a group expander 78. a 
document/group selector 80. a storage manager 82 and a group filter 84. These elements of the general purpose 
computer 890 are interconnected by a bus 70. The group generator 72, the keyword generator 74, the group score 
calculator 76, the group expander 78, the document/ group selector 80. the storage manager 82 and the group filter 
84. controlled by controller 86. are used to implement the flow chart of Figs. 4 and 5. It should be appreciated that 
many other implementations of these elements will be apparent to those skilled in the art. 

r00571 The group generator 72 implements the control routine outlined in Fig. 5. The keyword generator 74 generates 
keywords for each group. The keywords may then be used to indicate the content of the file to a user viewing a group 
listing The keyword generator 74 may be a type of document summarizer. The group score calculator 76 generates a 
qroup score An example method for calculating a group score has been detailed above. The group score may be used 
to rank the groups in a group listing. The group expander 78 expands a selected group into a list of documents contained 
within the selected group. As detailed earlier, a user may then select individual documents on which to perform various 
storage management functions. The document selection is performed using document/group selector 80. The storage 
manager 82 performs various storage management functions on the selected document(s) or group(s) The storage 
manager can perform many processes that are common to many conventional storage managers such as deleting 
moving, copying, or adding a pointertoa document, flagging a document for a later process or compress.ng the document 

^58] ba vvhrihis invention has been described with the specific embodiments outlined above, many alternatives 
modifications and variations are apparent to those skilled in the art. Accordingly, the preferred embodiments described 
above are illustrative and not limning. Various changes maybe made without departing from the sp.rrtand scope of the 
invention as defined in the following claims. 



1 . A document storage management system communicating with a document storage device, a plurality of documents 
stored on the document storage device, the system comprising: 

a group generator that generates at least one group of documents from the plurality of documents based on 
the content of the plurality of documents; 

a selector that is capable of selecting one of the at least one group of documents; and 
a storage manager that manages storage of the selected group of documents. 

2. The system of claim 1 , wherein the group generator generates the at least one group of documents based on the 
similarity of the content of the plurality of documents. 

3. The system of claim 1 or claim 2. wherein the group generator generates at least one group of documents based 
further on at least one attribute of each of the plurality of documents. 

4. A method for managing a document storage system, comprising: 

grouping a plurality of documents stored in a document storage device into a plurality of groups based on the 

content of the plurality of documents; 

selecting at least one of the plurality of groups; and 

managing the storage of the selected at least one of the plurality of groups. 

5. The method of claim 4, wherein the grouping is based on the similarity of the content of the plurality of documents. 

6. The method of claim 4 or claim 5, wherein the plurality of documents are grouped further based upon at least one 
attribute of each of the plurality of documents. 
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7. A graphica. user interface for managing the storage of a plurality of documents that are stored in a document 
storage device, the interface comprising. 

itS^ 2222 gtT^rvXe on the dispfcy. the al least one selectable group , jr-p- 
resen^g corresponding groups of documents, wherein the groups of documents compnse the p ^ tyof 
o^umentsmat a^ grouped based on the content of the plurality of documents where.n the at east one 
SXgroup identffier h responsive toa selection of the at least one selectable group .dentrfler to se.ect 
a corresponding at least one group of documents; and ejected at 

a storagemanager responsive to a command to perform a storage management function on the selected at 
least one group of documents. 

8. The interface of claim 7, wherein the plurality of documents are grouped based on the similarity of the content of 
the plurality of documents. 

9. The interface of claim 7 or claim 8, wherein the plurality of documents are grouped further based on at least one 
attribute of each of the plurality of documents. 

10. The interface of any of claims 7 to 9. wherein the at least one selectable group identifier is displayed in a list that 
is ordered based on at least one group characteristic. 
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